class: center, middle, inverse, title-slide # t-tests and their applications
🍵 ### S. Mason Garrison --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Methods in Psychological Research</a> </span> </div> --- class: middle # t-tests and their applications --- ## Roadmap: Last Week .large[ - p-values - Hypothesis Testing - Experimental Designs ] --- ## Roadmap: This Week .large[ - t-test logic - one sample t-tests - two sample t-tests - Interrogating Null Effects ] --- class: middle # t-tests --- ## Context - We studied the characteristics of the `\(Z\)`-statistic - for comparing a mean with a null-hypothesized value. -- - In the process, we learned a number of general principles about - hypothesis testing, - power, and the factors affecting power and - the sample size n needed to achieve it. --- ## Z-Test to T-Test - The Z statistic itself is virtually never used in practice. -- - .hand-blue[Why?] -- - Because the population standard deviation `\(\sigma\)` is not known, - anymore than the population mean is. -- - .hand-blue[So, what do we do?] --- ## Solution? - We could substitute an estimate of `\(\sigma\)` (*i.e.*, `\(\hat{\sigma}\)` ) - in place of `\(\sigma\)`. -- - Could we replace it with... - the sample standard deviation (*s*)? -- - Thus, creating a modified `\(Z\)` statistic? -- <br><br> .center.large[ `\(Z_{modified} = \frac{M- \mu_{0} }{s/\sqrt{n}}\)` ] --- ## Thinking Intuitively - The original `\(Z\)` statistic modified M by subtracting a constant, - then dividing by a constant. - The only thing in the `\(Z\)` statistic that would vary over repeated samples is - M, the sample mean. - This means that the distribution of `\(Z\)` has to be the - same shape as the distribution of M --- ## Thinking Intuitively .pull-left[ - The modified `\(Z\)` statistic has a - sample quantity in its denominator. - That quantity varies over repeated samples along with M. - So now, instead of only one thing varying, - you have two. - It turns out that, as `\(n\)` gets larger and larger, - these modifications matter less and less, - because `\(s\)` starts acting more and more like the constant that it is estimating. ] .center.pull-right.large[ <br> <br> <br> `\(Z_{modified} = \frac{M- \mu_{0} }{s/\sqrt{n}}\)` ] --- ## Thinking Intuitively - In fact, it was known back around 1900 that, as `\(n\)` goes to infinity, - the modified `\(Z\)` statistic's distribution got closer and closer to - the distribution of the original `\(Z\)` statistic. - What people didn't know was... - how to characterize the performance of the modified `\(Z\)` statistic at small sample sizes. --- ## Student's T .pull-left[ - W.S. Gossett was a statistician working for the Guinness brewery, - when he derived the exact distribution of `\(Z_{modified}\)` under some specific conditions. - This development was considered something of a landmark by the statistical community. - However, due confidentiality and conflicts of interest, - Gossett published his work under the pen name of "Student". ] .pull-right[ <img src="data:image/png;base64,#../img/William_Sealy_Gosset.jpg" width="65%" style="display: block; margin: auto;" /> .small[ - [source: wikimedia](https://commons.wikimedia.org/wiki/File:William_Sealy_Gosset.jpg)] ] --- ## Student's T .pull-left[ - Thus, the modified `\(Z\)` statistic became known as - “Student's t statistic” in his honor. - The distribution of the statistic became known as - “Student’s t distribution,” - This statistic and distribution has many applications - beyond what we are reviewing here ] .pull-right[ <img src="data:image/png;base64,#../img/William_Sealy_Gosset.jpg" width="65%" style="display: block; margin: auto;" /> .small[ - [source: wikimedia](https://commons.wikimedia.org/wiki/File:William_Sealy_Gosset.jpg)] ] --- ## t-distribution facts .pull-left[ - the t-distribution varies as a function of: - degrees of freedom (df) - For the sake of simplicity, df will be defined as one less than the number of observations in the sample. - Df = n-1 (n: sample size) .center.large[ `\(t_{df} = \frac{z}{ \sqrt{\frac{ \chi^{2}_{df}}{df}}}\)` ] ] .pull-right[ <img src="data:image/png;base64,#01_roadmap_files/figure-html/ztplot-1.png" width="90%" style="display: block; margin: auto;" /> ] --- <img src="data:image/png;base64,#01_roadmap_files/figure-html/ztplot-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Z vs T <br> | | Z-Distribution | T-distribution | |---|---|---| | Do we know the population variance? | Yes | No. We substitute sample variance `\(s^{2}\)` for `\(\sigma^{2}\)`| | shape | Bell shaped and symmetric | Bell shaped and symmetric, but fatter tail than z-distribution. | | Mean | 0 | 0 | | Variance | 1 | df/(df-2): proportionally larger than z | | score | `\(Z = \frac{M- \mu_{0} }{\sigma/\sqrt{n}}\)` | `\(t = \frac{M- \mu_{0} }{s/\sqrt{n}}\)` | --- class: middle # Wrapping Up... --- class: middle # Three t-test Applications --- ## Three t-test Applications - One sample t-test - Used when we want to know whether a sample we collected come from a particular population with unknown mean `\(\mu\)`. - (similar to what we did with z-test so far) -- - Matched pair t-test - Used when the two samples of data were related or provided by the same participants - (*e.g.*, pre- and post-test) -- - Independent sample t-test - test the difference between the means of two independent groups - (*e.g.*, treatment and control group) --- ## General Procedure .large[ 1. Decide what type of test we want to use 2. Decide what the null and alternative hypothesis is. 3. List what we have 4. Compute t-statistic 5. finding critical value in t-table 6. Compare t-statistic to t* - (we can also calculate p and compare to `\(\alpha\)`) 7. Make decision: reject null or not, and draw conclusion ] --- ## One Sample t-test - Used when we want to know whether a sample we collected come from a particular population with unknown mean `\(\mu\)`. -- - Example: - We had a group of 28 students taking a reading quiz, but had not seen the passage on which the questions were based (sample mean: 46.21, sample sd: 6.73). - If the students had really guessed blindly, without even looking at the possible answers, - we would expect that they would get 20 items correct (OUT OF 100) by chance. -- - Question: - Did the students guess by chance? --- ## Workflow - `1`. Decide what type of test we want to use - We don't know population sd - ∴ t-test - We only have one sample, and we want to know whether it is from a particular population. - ∴ one sample t-test - `2`. Decide what the null and alternative hypothesis is. - Null: students are guessing. `\(H_{0}: \mu=20\)` `\((\mu_0)\)` - Alternative: students are not guessing. `\(H_{1}: \mu\ne20\)` `\((\mu_1)\)` --- ## Workflow - `3`. List what we have - `\(\bar{x}= 46.21\)` - `\(\mu=20\)` - `\(N=28\)` - `\(s=6.73\)` - `4.` Compute t-statistic - `\(t = \frac{M- \mu_{0} }{s/\sqrt{n}}\)` = - (46.21−20)/(6.73/ `\(\sqrt{28}\)`)= 20.61 --- ## Workflow - `5`. finding critical value in t-table - Df = n-1 = 27 - Because we specified the test as two-tailed, - a 95% CI will have a upper tail probability of 0.05/2= 0.025 - t*=2.051 - `6`. Compare t-statistic to t* (we can also calculate p and compare to `\(\alpha\)`) - 20.61 > 2.051 - `7`. Make decision: reject null or not, and draw conclusion - We reject null hypothesis based on the results, the group of students are not guessing. --- ## One-sample t-test: does mean = X? - e.g. Question: Published data suggests that the microarray failure rate for a particular supplier is 2.1% - **Genomics Core want to know if this holds true in their own lab?** --- ## One-sample t-test: does mean = X? - Null hypothesis, `\(H_0\)`: + Mean monthly failure rate = 2.1% - Alternative hypothesis: `\(H_1\)`: + Mean monthly failure rate `\(\ne\)` 2.1% - Tails: *two-tailed* - Either *reject* or *do not reject* the null hypothesis <!-- - ***Never accept the alternative hypothesis*** --> --- ## One sample t-test; the data .small.pull-left[ |Month | Monthly.failure.rate| |:---------|--------------------:| |January | 2.90| |February | 2.99| |March | 2.48| |April | 1.48| |May | 2.71| |June | 4.17| |July | 3.74| |August | 3.04| |September | 1.23| |October | 2.72| |November | 3.23| |December | 3.40| ] .pull-right[ - mean = `\((2.9 + \dots + 3.40) / 12\)` = 2.841 - Standard deviation = 0.837 - Hypothesized Mean = 2.1 ] --- ## One-sample t-test; key assumptions .pull-left-narrow[ - Observations are independent - Observations are normally distributed ] .pull-right-wide[ <img src="data:image/png;base64,#01_roadmap_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## One-sample t-test; results .pull-left[ - Test statistic: `\(t_{n-1} = t_{11} = \frac{\bar{x} - \mu_0}{s.d. / \sqrt{n}}\)` `\(= \frac{2.84 - 2.10}{s.e.(M)} =\)` 3.065 ] .pull-right[ <img src="data:image/png;base64,#01_roadmap_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## One-sample t-test; results <img src="data:image/png;base64,#01_roadmap_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" /> --- ## One-sample t-test; results - Test statistic: `\(t_{n-1} = t_{11} = \frac{\bar{x} - \mu_0} {s.d. / \sqrt{n}} = \frac{2.84 - 2.10}{s.e.(\bar{x})} =\)` 3.065 - df = 11 - P = 0.01 - ***Reject*** `\(H_0\)` - Evidence that mean monthly failure rate `\(\ne\)` 2.1% --- ## One-sample t-test; results - The mean monthly failure rate of microarrays in the Genomics core is 2.84 (95% CI: 2.30, 3.37). - It is not equal to the hypothesized mean proposed by the company of 2.1. - t=3.07, df=11, p=0.01 --- ## 3rd Example - We wanted to test whether the volume of a shipment of lumber is less than usual: - `\(\mu_{0} = 39000\)` cubic feet .pull-left-narrow[ - Classic `\(R\)` syntax - t.test(y, mu = 0) - where x is the name of our variable of interest, and - mu is set equal to the mean specified by the null hypothesis. ] .pull-right-wide[ ```r set.seed(0) treeVolume <- c(rnorm(75, mean = 36500, sd = 2000)) t.test(treeVolume, mu = 39000) # Ho: mu = 39000 ``` ] .center.footnote[Source Code: https://datascienceplus.com/t-tests/] --- ## Output ``` ## ## One Sample t-test ## ## data: treeVolume ## t = -12.288, df = 74, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 39000 ## 95 percent confidence interval: ## 36033.60 36861.38 ## sample estimates: ## mean of x ## 36447.49 ``` --- class: middle # Wrapping Up... <!-- - source code: https://ggplot2tutor.com/tutorials/sampling_distributions - source code: https://github.com/bioinformatics-core-shared-training/IntroductionToStats # References -->